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1 Introduction 

This work is a part of the SHIVA (Secured Hardware Imniune Versatile Archi- 
tecture) project whose purpose is to provide a programmable and reconfigurable 
hardware module with high level of security. We propose a recursive double-size 
fixed precision arithmetic called Reclnt. Our work can be split in two parts. 
First we developped a CH — h software library with performances comparable to 
GMP ones. Secondly our simple representation of the integers allows an imple- 
mentation on FPGA. 

Concerning the software part, we remarked that most often the general pur- 
pose arbitrary precision GMP library is faster for cryptographic routines than 
special purpose libraries such as OpenSSL or Miracl. Then we found that GMP 
could be improved on very small precision. Our idea is to consider sizes that are 
a power of 2 and to apply doubling techniques to implement them efficiently: 
we design a recursive data structure where integers of size 2*^, for fc > fep can 
be stored as two integers of size 2''~^. Obviously for k < kg we use machine 
arithmetic instead (fco depending on the architecture). Our design makes use of 
CH — h template mechanism so that we can define a generic doubling structure 
for large k and specialize it to machine arithmetic for small values of fc. If some 
routines can be implemented faster for some specific sizes, the template mech- 
anism allows also partial specializations of these routines. 

We provide a prototype implementation showing good performances on desktop 
PC's: the PALOALTO hbrar£|. 

Concerning the hardware part, our first works are based on the transforma- 
tion of C++ sources to VHDL using dedicated softwares. We show that our 
first results are promising. 



^ https://www.ljkforge.imag.fr/projects/paloalto/. 
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2 Recursive data structure 



The main idea is to represent an element of size 2^ with two elements of size 
2^^^ . We note Reclnt<fc> our recursive integer of size 2^ bits. That leads to a 
recursive structure with specializations for small sizes. 

J Reclnt<fc> a i— >■ (RecInt<A; — 1> a. High, Reclnt<fc — 1> a. Low) 
\ RecInt<A;o> of size 2''° called limb. 

2fc-l 

where a = a. High * 2 + a. Low. 

All these integers are unsigned, hence a Reclnt<fc> is capable to store any 
unsigned integer within the range 0..2^ . 




Figure 1: Recursive structure of a Reclnt<8> in a 64 bits architecture. 



3 Operations 

Obviously all the classical arithmetic operations are provided. However we focus 
on specific ones. 

3.1 Extending the word size 

The idea is to provide a very fast arithmetic for integers of size a power of two 
which would mimic the behaviour of the word-size arithmetic: operations are 
correct modulo some 2^ . Indeed computing the remainder with such a modulus 
on a binary architecture comes to just keeping the correct number of bits. 
For instance, machine size arithmetic on a 32 bits architecture is done modulo 
2^ , and modulo 2^ on a 64 bits architecture. 

For truncated addition and subtraction, that comes to not keeping the carry 
(or the borrow). 
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When multiplying two integers of at most 2^ bits modulo 2^ , one does not 
need to compute the highest bits of the product as in the following figure. 
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Figure 2: Truncated multiplication 

Hence, one truncated multiplication of level fc requires only 1 complete mul- 
tiplication and 2 truncated multiplications of level fc — 1 (instead of 4 complete 
multiplications for a nai've complete multiplication). 



3.2 Recursive division 

We use a recursive method for Euclidean division of integers described in [T]. 
This method uses two sub-algorithms dividing respectively 2 digits by 1 digit 
and 3 halves by 2. They allow then a recursive division of a s-digits integer by 
a r-digits integer with complexity 0(rs'°^'^^^^^ -f r log(s)). 

3.3 Montgomery modular multiplication 

A nai've way of performing a modular multiplication is: performing a complete 
multiplication and then a modular reduction. However in the general case, this 
reduction is done with a division, which is time consuming. In the case there 
exists a radix i? such that computations modulo i? are inexpensive to process, 
Montgomery gives in [5j a method for performing a modular multiplication with- 
out trial division. Actually the complete multiplication is still performed but 
the reduction is done as following: 
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function REDC(r) 

m (T mod R)N' mod R 
t^{T + mN)/R 

if t > N then return t — N else return t. 

where 0< R-'^ < N and O < N' < R satisfy RR'^ - NN' = 1. 

This algorithm computes REDC(r) = TR-^ mod if < T < RN. 
Thus, REBC{{AR mod N){BR mod iV))= ABii mod N. This means that if 
wc note A = AR mod N and _B = BR mod iV (caUed Montgomery representa- 
tions of A and B), then REDC(AS)=AB. 

Hence, as long as we stay in Montgomery representation no division is re- 
quired. In order to come back to the regular representation, one needs to perform 
a multiplication by R~^ modulo N. 

In our case, if one wants to multiply two Reclnt</c> A and B, R = 2^ . If 
A is a RecInt<A; + 1>, 

— Reduction of A modulo R comes to take A. Low. 

— Exact division of A by i? comes to take A. High. 

Hence REDC requires only 1 truncated multiplication and 1 complete mul- 
tiplication. 

4 C++ Library 

This library depends on gmp (for machine word arithmetic). Hence if gmp is 
intalled on your computer you can use our library by typing at the beginning 
of your C++ program: 

#define 32bits // or 64bits depending on your processor 

#include "recint.h" 

4.1 Template recursive data structure 

In order to simplify the development we chose to use a template recursive data 
structure with partial specialization. This specialization depends on the archi- 
tecture. 

— Reclnt<5> ^ uint32 in a 32 bits architecture. 

— Reclnt<6> ~ uint64 in a 64 bits architecture. 

First our library allows manipulation of fixed precision integers. Thus we 
define the template structure Reclnto as following: 
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template <size_t k> struct Recint { 
typedef Recint <k+l> Father_t ; 
typedef Reclnt<k> Self_t; 
typedef Reclnt<k-1> Half_t; 

// High = most significant part 
// Low = least significant part 
// *this == High * 2-(2-(k-l)) + Low 
Half_t High, Low; 

>; 

template <> struct RecInt<LIMB_SIZE> { 
limb Value; 

>; 

where limb represents the machine word. 

4.2 Classical operations 

Parameters of functions liavo been chosen to be passed by reference in order to 
avoid copying them at each call. 

Functions that return a boolean are always of the following form: 

template <class T> bool RI_is_equal_to (const T&, const T&) ; 

All the other functions are void functions whose first parameters are the output 
values and the last ones, considered as const are the input ones: 

template <class T> void RI_add(limb&, T&, const T&, const T&) ; 

Classic functions are split into the following sections according to their use: 

4.2.1 Operators 

We define the following operators for basic arithmetic. 

— Arithmetic operators: +, -, *; 

— In place operators: +=, -=, *=; 

— Increment and decrement operators: ++, --; 

— Comparison operators: ==, !=, <, >; 
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4.2.2 Comparison functions 

Comparison between Recint: 



— char RI_comp(const Reclnt<k>& a, const Reclnt<k>& b) retiirns +1 
if a>b, if a==b and - 1 if a<b. 

— bool RI_is_equal_to (const Reclnt<k>& a, const Reclnt<k>& b) re- 
turns true if and only if a==b. 

— bool RI_is_greater_than (const Recint <k>& a, const Recint <k>& b) 
returns true if and only if a>b. 

— bool RI_is_lower_than(const Reclnt<k>& a, const Reclnt<k>& b) 
returns true if and only if a<b. 

Comparison with a constant: 

— bool Rl_is_equal_to_0 (const Reclnt<k>& a) returns true if and only 
if a==0. 

— bool RI_is_equal_to_l(const Reclnt<k>& a) returns true if and only 
if a==l. 

— bool RI_is_equal_to_limb(const Reclnt<k>& a, const limbfe b) re- 
turns true if and only if a==b. 

4.2.3 Set functions 

We provide a sot of functions permitting to set a part of an integer or the whole 
integer to a specified value. 

— void RI_reset (Recint <k>& a) resets a to 0. 

— void RI_random(RecInt<k>& a) sets a to a random value. 

— void RI_set_limb (Recint <k>& a, const limb& b, const unsigned int& n) 
sets the n*'* limb of a to the value b (the 0*'* limb is the least significant 

one). 

— void RI_set_const(RecInt<k>& a, const limbfe b) sets the 0*'* limb 
of a to the value b. 

4.2.4 Get functions 

The following functions permit to get value(s) from an integer: 

— void RI_get_liinb (limbfe 1, const Reclnt<k>fe a, const unsigned intfe n): 
1 is set to the value of the n*'* limb of a. 

— void RI_get_IimbO (limbfe I, const Reclnt<k>fe a): 1 is set to the 
value of the least significant limb of a. 
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— void RI_get_limbn(limb& 1, const Reclnt<k>& a): 1 is set to the 

value of the most significant limb of a. 

— void RI_copy(RecIiit<k>& a, const Reclnt<k>& b) copies the value 
of b into a. 

4.3 Arithmetic functions 

The following classic operations will always output the full precision of the 
computation. 

— void RI_add(limb& 1, Reclnt<k>& a, const Reclnt<k>& b , const Reclnt<k>& c): 

a = b + c + l*2'^\ 

— void RI_add(limb& 1, Reclnt<k>& a, const Reclnt<k>& b , const limbfe c): 
a = b + c + l*2'^'° . 

— void RI_increinent (limbft 1, Reclnt<k>& a): a <— a + 1 + I * 2'^'° . 

— void RI_sub (limbfe 1, Reclnt<k>& a, const Reclnt<k>& b , const Reclnt<k>fe c): 
a = b-c + l*2'^''. 

— void RI_sub(liinb& 1, Reclnt<k>& a, const Reclnt<k>& b , const limbfe c): 

a = b- + 1*2'^'' . 

— void RI_decrement (limbfe 1, Reclnt<k>fe a): a <— a — 1 + I * 2'^'° . 

— void RI_lmul (Reclnt<k>fe ah, Reclnt<k>fe al, const Reclnt<k>fe b , const Reclnt<k>fe 
ah* 2^ + al = b*c. 

— void RI_lmul (limbfe ah, Reclnt<k>fe al, const Reclnt<k>fe b , const limbfe c): 

ah * 2^ + al = h * c. 

— void RI_div(RecInt<k>& q, Reclnt<k>& r, const Reclnt<k>fe a , const Reclnt<k>fe b) 
q and r are respectively the quotient and the remainder in the Euclidean 

division of a by b: o = 6 * g + r with r < b. 

— void RI_div_quotient (Reclnt<k>fe q, const Reclnt<k>fe a , const Reclnt<k>fe b): 
only the quotient q is output. 

— void RI_div_remainder (Reclnt<k>fe r, const Reclnt<k>fe a , const Reclnt<k>fe b): 

only the remainder r is output. 

— void RI_square(RecInt<k>fe ah, Reclnt<k>fe al, const Reclnt<k>& b): 
0/1*22' + a/ = 62. 

— void RI_gcd(RecInt<k>& g, const Reclnt<k>& a, const Reclnt<k>fe b): 
g is set to the Gcd of a and b. 
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— void RI_ext_gcd(RecInt<k>& g, boolfe su, Reclnt<k>& u, boolfe sv, 
Reclnt<k>& v, const Reclnt<k>& a, const Reclnt<k>& b): g is set 
to the Gcd of a and b with Bezout coefficients u and v respectively with 
sign su and sv (su==0 means that u is negative). 



Addition and subtraction functions are provided with extended functions 
having suffixes _in, _nc or combined _nc_in. 

The _in suffix means that the operation is made in place: 
void RI_add_in(limb& 1, Reclnt<k>& a, const Reclnt<k>& b): a a+6 
and 1 is the carry. 

The _nc suffix means that the carry (or borrow) is not output: 
void RI_sub_nc(RecInt<k>& a, const Reclnt<k>& b, const Reclnt<k>& c): 
a 6 — c. 



4.4 Extending the word size 

The following functions return results modulo 2^*° : 

— void RI_add_nc (Reclnt<k>& a, const Reclnt<k>& b, const Reclnt<k>& c) 
returns a such that a=b+c modulo 1? . 

— void RI_sub_nc(RecInt<k>& a, const Reclnt<k>& b, const Reclnt<k>& c) 
returns a such that a=b-c modulo 2^ . 

— void RI_mul (Reclnt<k>& a, const Reclnt<k>& b, const Reclnt<k>& c) 

returns a such that a=b*c modulo 2^ . 

4.5 Modular operations 

The PALOALTO Library allows manipulation of modular integers as well. 



4.5.1 Modular operations on Recint 

Our library provides modular operations on classical Recint. The user must 
specify the module n at each call of any operation. At the beginning of all 
functions, the input parameters are reduced modulo n. The outputs are also 
computed modulo n. 

— void RI_reduction(RecInt<k>& a, const Reclnt<k>& b, const Reclnt<k>& n): 
a i— b mod n. 

— void RI_randoin_mod (Recint <k>& a, const Reclnt<k>& n): a is set to 
a random value within the range 0..n — 1. 

— void RI_neg_inod(RecInt<k>& a, const Recint <k>& b, const Recint <k>& n): 
a < b mod n. 



8 



— void RI_add_mod(RecInt<k>& a, const Reclnt<k>& b, const Reclnt<k>& c, 

const Reclnt<k>& n) : a ^ b + c mod n. 

— void RI_sub_mod(RecInt<k>& a, const Reclnt<k>& b, const Reclnt<k>& c, 
const Reclnt<k>& n) : a <— b — c mod n. 

— void RI_inul_inod(RecInt<k>& a, const Reclnt<k>& b, const Reclnt<k>& c, 
const Reclnt<k>& n) : a 6 * c mod n. 

— void RI_mul_mod(RecInt<k>& a, const Reclnt<k>& b, const limbft c, 
const Reclnt<k>& n) : a 6 * c mod n. 

— void RI_square_inod(RecInt<k>& a, const Reclnt<k>& b, const Reclnt<k>& n): 

a b^ mod n. 

— void RI_exp_mod(RecInt<k>& a, const Reclnt<k>& b, const Reclnt<k>& c, 
const Reclnt<k>& n) : a <— b" mod n. 

— void RI_exp_mod(RecInt<k>& a, const Reclnt<k>& b, const limbft c, 
const Reclnt<k>& n) : a <— b" mod n. 

— void RI_inv_mod(RecInt<k>& a, const Reclnt<k>& b, const Reclnt<k>& n): 

a <r- b~^ mod n. 

— void RI_div_mod(RecInt<k>& a, const Reclnt<k>& b, const Reclnt<k>& c, 
const Reclnt<k>& n) : a 6 * mod n if c is invertible modulo n. 

— bool RI_is_quadratic_residue (const Reclnt<k>& a, const Reclnt<k>& n) 
returns true if and only if a is a quadratic residue modulo n. 

— void RI_square_root_inod(RecInt<k>& a, const Reclnt<k>& b, const Reclnt<k>& n): 

a is such that a? = b mod n. 

The neg, add and sub functions are extended with the sufSx _in (in place 
operation) as for the classic operations. Note that the _nc suffix does not make 
any sense in this situation. 



4.5.2 RecIntMod type 

A special type is provided for modular operations. RecIntModo is basically a 
Reclnto provided with a module p declared as a static Reclnto of the same 
size. Furthermore we guarantee that such elements are always reduced modulo 
p. Actually we guarantee that all inputs and outputs of functions manipulating 
RecIntModo are reduced modulo p (that is to say within the range 0..p — 1). 

template <size_t k> struct RecIntMod { 
typedef Reclnt<k> Reclntk; 



static Reclntk p; 
Reclntk Value; 

}; 
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Two functions allowing the conversion from RecIntModo to Reclnto and 
vice versa are available: Convert_to_RecInt (Reclnt<k>& , RecIntMod<k>) 
and Convert_to_RecIntMod(RecIntMod<k>&, Reclnt<k>). 



4.5.3 Modular operations on RecIntMod 

The same operations as in last section are applicable to RecIntModo integers. 
However the module p must be initialized before any arithmetic operation with 
the following function: 

void RI_init_inodule (const Reclnt<k>& p) 

Since the module is declared as static, one has to initialize it only once be- 
fore any computation. If done so, we guarantee that any RecIntModo integer 
will be always reduced modulo p throughout the program. Note that the user 
is allowed to change the module a posteriori, but the reduction modulo the new 
module will not be guaranteed anymore. 

The user can get the module back by using: 
void RI_get_inodule (Reclnt<k>& p) 

Once the module has been initialized, the user can use the operations pre- 
sented in Modular operations on Recint section, since they are overloaded for a 
use with RecIntMod. 



5 CH — h library Performances 

In order to evaluate the performance of the PALOALTO prototype, we used 
GMP's assembly routine for double machine word arithmetic (e.g. umul_ppiimi 
defined in the GMP [3] file longlong.h, multiplying two integers and generating 
their two- word product). 

Using these assembly routines and the recursive data structure detailed above, 
we were able to get the performance of the following tables for fixed precision 
operations with recursive data structures. 

For these performance comparisons we use the GMPbench suite [4]. 
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Figure 3: Fixed precision addition with Recint versus GMP-5.0.1, gcc 4.4.0, 
Xeon X5482, 3.2GHz, in millions of arithmetic operations per second. 




Figure 4: Fixed precision complete and truncated multiplications with Recint 
versus GMP-5.0.1, gcc 4.4.0, Xeon X5482, 3.2GHz, in millions of arithmetic 
operations per second. 
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Figure 5: Fixed precision modular multiplication with Recint versus GMP-5.0.1, 
gee 4.4.0, Xeon X5482, 3.2GHz, in millions of arithmetic operations per second. 
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Figure 6: Fixed precision modular exponentiation with Recint versus GMP- 
5.0.1, gee 4.4.0, Xeon X5482, 3.2GHz, in millions of arithmetic operations per 
second. 
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The step shape of Recint curves is explained by the fact that we use a 
Reclnt<fc> for all integers with size within the range 2''~^ — 1..2'^. 

Results obtained with Recint are comparable to those with GMP. However, 
Recint appears to be more efficient for small fixed precision. 

6 Towards FPGA implementation 

We present here the first attempts towards an implementation on a real FPGA. 
Inside SHIVA project, we need to provide basic arithmetic modules in order 
to be used in a RSA or Elliptic curve based encryption scheme. In order to 
build these modules, since our C++ library is already written, we chose to use 
a dedicated software transforming C++ source into VHDL called GAUT [2]. 
The creation of VHDL program can be split in the following steps: 

— Compilation of CH — h source and creation of the corresponding graph. 

— Compilation of the library containing the needed operations. 

— Synthesizing of the VHDL program and estimation of performances. 

Here are some simulations of a modular exponentiation on a Virtex 5. We 
made the output flow vary in order to check the effect on the required size on 
the FPGA. 
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Figure 7: 128 bits words. 
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We notice that required size can be significantly reduced if we accept a lower 
output flow. 
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These results have been obtained without significant modifications on the 
C++ source. Thus they are not optimal but rather promising. 

Further work will consist in optimizing C++ source in order to make it more 
adapted to VHDL synthesizing. 
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